-
Notifications
You must be signed in to change notification settings - Fork 6
Feat: restructuring of communication methods (and buffered communication for CUDA-Aware MPI) #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A new DistributedMix class is create with the aim of simpflify and unify all comm. calls in both DistributedArray and operators (further hiding away all implementation details).
|
@tharittk great start! Regarding the setup, I completely agree with the need to change the installation process for CUDA-Aware MPI. I have personally so far mostly relied on conda to install Regarding the code, as I briefly mentioned offline, whilst I think this is the right way to go:
i am starting to feel that the number of branches in code is growing and it is about time to put it all in one place... what I am mostly concerned is that this kind of branches will not only be present in
@astroC86 we have also talked a bit about this in the context of your |
|
@tharittk I worked a bit more on this, but there is still quite a bit to do (added to the TODO list in the PR main comment).... Also i am not really sure why some tests are failing on some specific combinations of python/mpi/nranks but not on others... have not investigated yet... |
mrava87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk I left a few comments, Distributed is partially unfinished as we need to make sure all methods get passed the same inputs and don't rely on self.* so that they can be used by both DistributedArray and operators.
Also running tests locally I pass NumPy+MPI and CuPY+MPI but for CuPy+NCCL I get a seg fault at tests_nccl/test_solver_nccl.py::test_cgls_broadcastdata_nccl[par0] Fatal Python error: Segmentation fault. Same for you?
|
@rohanbabbar04 I remember we discussed long time ago about this and you were actually the first to suggest using mixins... feel free to take a look and provide any feedback 😄 |
I don't have the problem with the CuPy + NCCL - I still got 309 test passed. This is my seqeuence of command: I switch to |
MMh interesting... I installed I fixed that 😄 So I can now run locally the following with success: Seems like that the installation I thought had Cuda-aware MPI was not and things worked as I had PYLOPS_MPI_CUDA_AWARE=0 set... since you have Cuda-aware MPI can you please test the entire suite? Apart from this (which we definitely need to try to put on a CI (it is just too many things to do locally now...), once you can handle my code comments above we should be almost ready to merge 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me :)
I am also going to test it once the changes are done on a multi gpu system on chameleon cloud.
Also, I agree we should set up CI for this. Maybe I can take a look into it.
|
Thanks @rohanbabbar04! For the CI, is not so easy as Githu Lb does not offer GPU machines. In PyLops, @hongyx11 set it up using his university cluster as self-hosted runner )https://github.com/PyLops/pylops/blob/dev/.github/workflows/buildcupy.yaml). He has been looking to extend this to a multi-gpu setting for pylops-mpi but I am not sure how far he has gone 😄 |
|
@tharittk I have finalized removing all But I am still facing an issue with Cuda-aware MPI tests (finally got it to work!): the last 3 tests in |
|
@mrava87 I assume you mean uncommenting this line and test pylops-mpi/tests/test_stackedarray.py Line 142 in 489dfa6
I did I have 12 tests pass and nothing fails. |
@tharittk No, I had not even noticed that 😉 What I meant was that I could only get the tests to pass if i did: basically comment out the 3 Interesting that everything works fine for you. Btw, what is the Not sure why there are inconsistencies between versions of MPI, but I am not 100% surprised. Since I do not have pre-instaled cuda-aware MPI, my installation process to get things working (but tests for those 3 parameters) was: |
|
@tharittk, so if you confirm all tests run for you, and you are also confortable with the current version of the code, I am happy for this to go ahead and be merged... I can investigate on my end why I have this issue, but it should not be a blocker 😄 |
|
The For this,
...
I think it is quite different from what I did that made it work for the NCSA cluster. which kind of forces the installation of I'm pretty comfortable with current code. In the meantime, I want to go checkout the DeltaAI, see if it offers CUDA-aware MPI. If so, I want to try install pylops-mpi on that system and run the test to see if everything looks ok. |
|
Alright, makes sense (about the -s). And yes, I know your current setup is quite different, I just wanted to mention what I did to get Cuda-aware MPI to work for me for the sake of record… since I don’t have a pre-installed MPI I can rely on, I decided (with the help of ChatGPT) to go the conda way so I don’t need to rely on any system installation of Cuda either… makes sense that you need to do some re-install with pip as you had already installed MPI4Py with conda and point to MPICC, for me it seems that since I install also MPI directly in my environment it is picked up automatically when installing MPI4Py😀 Happy to either wait until you test it on Delta or merge if you prefer (and we can always handle any issue that arises, if it arises, in a separate PR)? |
|
Yes, sure we can handle the issue that may arise separately. I have been trying to get into the DeltaAI compute node for some time now (they only have GH200). It is a challenge to get an allocation in particular because I need the interactive shell to test the installation. |
Objective
This PR has two main goals:
DistributedArrayand various linear operators to a single common place, namely theDistributedMixInclass, whose methods are used by bothDistributedArrayand the linear operatorsPYLOPS_MPI_CUDA_AWAREis introduced (defaults to 1 but can be set to 0 to force object communications)CUDA-Aware MPI
In order to have a CUDA-aware mpi4py installation mpi4py must be build against a CUDA-aware MPI installation. Since conda installations of mpi4py do not ship with a CUDA-aware MPI, it is therefore required to use pip for installing mpi4py. In the case for NCSA Delta, I create a new conda environment and do
module load openmpi/5.0.5+cudathen
MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py(where
--force-reinstallis only needed because we install alreadympi4pyas part of the conda environment creation.And to run the test (assuming you're in the compute node already):
To Do
mpi_allgathermethod uses the_prepare_allgather_inputsmethod to prepare inputs such that they are all of the same size (via zero-padding). Whilst this is strictly needed for NCCL, we should instead consider leveraging MPI `AllGatherv' instead to avoid extra padding and cutting of arrays - Use inAllGathervinmpi_allgather#169MatrixMultto remove any direct call to mpi4py communication methods - Use new unified communication methods inMatrixMult#170